How do climate variables affect produce prices?¶

Preface: This document is a summary. If requested by the lecturer, it will be used as an outline for an academic term paper.

Motivation¶

With the increasing intensity of climate change {cite}Brown2017, agricultural output and the prices of agricultural commodities are being affected {cite}baiImpactClimateChange2022. While investigation into cereals is widespread, less emphasis has been placed on vegetables and fruits which vary widely in their sensitivity to climate conditions. Turkey, being a leading vegetable and fruit producer and exporter, is vulnerable to volatility in global and local vegetable and fruit output and prices.

Better understanding the relationship between produce prices and climate variables may reveal policy implications for better management of the agricultural sector with respect to the ongoing Climate Transition. This project aims to investigate a single crop as an exploratory exercise.

How Are Wholesale Lemon Prices Affected by Meteorological Events and Climate?¶

This study will attempt to describe the relevance and magnitude of various climate factors on wholesale lemon prices. Auto-Regressions and Time Series Analysis will be utilized for this purpose.

Data¶

Four sources of data are used for this study:

  1. Istanbul Metropolitan Municipality Wholesale Market Prices
  2. Meteostat Weather Station Service (via meteostat python package)
  3. TURKSTAT Annual Fruit and Vegetable Production
  4. TURKSTAT Food Price Indices

Istanbul Metropolitan Municipality Wholesale Market Prices¶

Data discovered and downloaded from REST API. General characteristics:

  • 989,748 rows x 8 columns (id, name, measurement unit, category, daily low price, daily high price, market type, date)
  • Includes fish markets as well as produce markets.
  • Starting from 2003, quality of data markedly improves in 2005.

Meteostat Weather Station¶

Data was pulled from meteostat's python library. Initially three stations with sufficient data were selected, then reduced to one after review.

  • Weather station data consists of temperature data (daily min, max, average), precipitation in milimeters. Windspeed and direction was ignored, sunshine duration was not available.
  • Weather station is 17340
  • Data has a considerable gap between 2010 and 2014, analysis was restricted to post-2014

TURKSTAT Food Price Indices¶

Data was pulled from Turkish Central Bank Electronic Data Distribution System (EVDS)

Cleaning and Transformations¶

  1. Scraped market price data, originally saved as a set of json files, was parsed into a singular dataframe. See parse_halprices.ipynb notebook.
  2. Some preliminary cleaning was done to get rid of obvious clerical errors and inconsistencies in market price data. See clean_halprices.ipynb notebook.
    • Pre-2005 prices were converted to new Turkish liras.
    • Clerical errors where the reported daily low-price was higher than the high-price were fixed on a case-by-case basis.
    • Missing daily high-price data assumed to be equal to daily low-price.
    • Some obvious typos were corrected.
  3. After some review, Lemons were picked as the product to be investigated for the following reasons:
    • Data availability: Price series is consistent and not sparse.
    • Large volumes: Lemons are year-round staple both for households and industry.
    • Climate Sensitive: Lemons are not grown in greenhouses, are not especially hardy. They're effected by cold winters, heavy or low precipitation and heat waves.
    • Inflexible supply: Growing new groves takes years, so production area does not vary wildly.
    • Not trivially cheap.
  4. Downloaded weather station data was aggregated by months. See get_meteostation.ipynb notebook.
    • Daily temperature (min, avg, max) data was aggregated to monthly values: minimum, 10th percentile, average, 90th percentile, maximum and minimum
    • Monthly temperature metrics were added: days with minimum temperature below 5°C and days with minimum temperature below 0°C
    • Precipitation data was aggregated to monthly values: Total precipitation in milimeters, maximum precipitation in a day
  5. All of the cleaned data was aggregated into single dataframe. See agg_halprices.ipynb notebook.
    • Price data was filtered to only include Lemon series, then aggregated to monthly values: minimum low-price, maximum high-price, min-max difference, average daily low-price, average daily high-price
    • Price index joined by date.
    • Weather data joined by date.
    • Annual production joined by date.
In [ ]:
# Import libraries
import matplotlib as mpl
import polars as pl
import plotly.express as px
import plotly.io as pio
import statsmodels.api as sm

# Play nice with plotly
pio.renderers.default='notebook'

# Load dataframe
df = pl.read_parquet('../data/transformed/transformed_unified_frame.parquet').sort('date')

Summary Charts and Graphs¶

Dataframe Summary¶
In [ ]:
df.describe()
Out[ ]:
shape: (9, 35)
statisticdateobs_countavglowpriceavghighpriceavgdiffpricelowestpricehighestpricediffpriceproduce_pitmin_lowesttmin_10pcttmin_avgtmin_90pcttmin_highesttavg_lowesttavg_10pcttavg_avgtavg_90pcttavg_highesttmax_lowesttmax_10pcttmax_avgtmax_90pcttmax_highesttmin_days_below_zerotmin_days_below_fiveprcp_max_dayprcp_total_moyearly_productionyearly_cultivated_decareyearly_productivityyearly_production_mersinyearly_cultivated_decare_mersinyearly_productivity_mersin
strstrf64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64f64
"count""92"92.092.092.092.092.092.092.092.092.092.092.092.092.092.092.092.092.092.063.063.063.063.063.092.092.092.092.092.092.092.092.092.092.0
"null_count""0"0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.029.029.029.029.029.00.00.00.00.00.00.00.00.00.00.0
"mean""2017-02-14"20.739132.4552484.436951.9817012.1576095.0326092.8753.65516613.65760915.07967417.45045319.65478320.83478317.70978318.97543521.22934323.30217424.67282621.47777822.5161924.16384625.73238126.7476190.0217390.55434822.04130456.344565920376.347826339103.3913042.727823547418.434783166211.4347833.294948
"std"null1.7152811.3832511.8386650.9637141.270222.2049951.4758591.1222527.6173497.4108796.8114456.3495945.9865317.6418187.1699716.5353546.0812495.7427287.5596227.1050526.5580386.4084966.7373980.146631.70560926.71463486.313634166201.91052265509.7055750.246591105235.15865125352.6812030.409898
"min""2013-05-01"15.00.961.6090910.50.81.70.52.150897-0.40.975.57.989.03.65.229.6206912.113.95.98.399.29.29.20.00.00.00.0725230.0274252.02.365862407401.0135634.02.585384
"25%""2015-04-01"20.01.53.01.3181821.03.02.02.7900017.18.7211.30833313.8715.211.613.215.66428617.820.115.917.6419.04333320.321.20.00.04.36.8750550.0285696.02.532251450878.0145152.03.014231
"50%""2017-03-01"21.02.04.01.952.05.03.03.36918313.315.0818.30689720.1221.518.119.121.86774224.225.321.221.8623.926.4326.70.00.011.919.1950000.0324279.02.627093587392.0163719.03.359088
"75%""2019-01-01"22.03.05.4545452.3571433.07.04.04.54293820.521.523.57241426.127.124.325.0926.96333329.2130.027.828.629.60689731.533.60.00.032.068.51.1e6401545.03.063111656440.0193567.03.749553
"max""2020-12-01"26.07.759.55.07.010.07.06.08979125.626.527.2528.529.429.930.230.99354831.8533.433.333.734.54838735.437.41.08.0126.0504.11.188517e6469352.03.105761701747.0208910.03.831425
In [ ]:
px.bar(df, x='date', y='obs_count', title='Observed Monthly Price Points')
In [ ]:
px.line(df, x='date', y=['avglowprice', 'avghighprice', 'lowestprice', 'highestprice'], title='Lemon Price Series')
In [ ]:
df_real = df.with_columns(
    pl.col('avglowprice') / pl.col('produce_pi'),
    pl.col('avghighprice') / pl.col('produce_pi'),
    pl.col('avgdiffprice') / pl.col('produce_pi'),
    pl.col('lowestprice') / pl.col('produce_pi'),
    pl.col('highestprice') / pl.col('produce_pi'),
    pl.col('diffprice') / pl.col('produce_pi'),
)

px.line(df_real, x='date', y=['avglowprice', 'avghighprice', 'lowestprice', 'highestprice'], title='Lemon Price Series (Corrected for Food Inflation)')
In [ ]:
px.line(df, x='date', y=['tmin_lowest', 'tmin_10pct', 'tavg_avg', 'tmax_90pct', 'tmax_highest'], title='Monthly Temperature')
In [ ]:
px.bar(df, x='date', y=['prcp_total_mo', 'prcp_max_day'], title='Monthly Precipitation', barmode='group')

Analysis¶

Here's our gameplan:

  1. We will start by doing a seasonality decomposition and de-seasonalizing our prices and average temperature series.
  2. We will check if our data is stationary using the Augmented Dickey-Fuller test.
  3. We will try to estimate the lag length using the AIC and BIC estimators.
  4. We will then construct three models:
    • $\mathbf{AR(p)} \Rightarrow a(L)Y_t = \beta_{0} + u_t$
    • $\mathbf{ADL(p,q)} \Rightarrow a(L)Y_t = \beta_{0} + c(L)X_{t-1} + u_t$ where $q=0$
    • $\mathbf{ADL(p,q)}$ where $q=2$

where
$L^j Y_t = Y_{t-j}$ and $a(L) = \sum\limits_{j=0}^{p} a_jL^j$
$L^j X_t = X_{t-j}$ and $c(L) = \sum\limits_{j=0}^{q} c_jL^j$

To be on the safe side, we will use Heteroskedasticty and Autocorellation-Consistent (HAC) Errors.

TODO: Check for "breaks"

In [ ]:
# Let's start with a seasonal decomposition on our prices, using moving averages.
seasonality_analysis = sm.tsa.seasonal_decompose(
    df_real.select(['avglowprice', 'avghighprice', 'avgdiffprice', 'lowestprice', 'highestprice']),
    period=12
)

trend = seasonality_analysis.trend
seasonal = seasonality_analysis.seasonal
residual = seasonality_analysis.resid

with mpl.rc_context():
    mpl.rc("figure", figsize=(16,8))
    fig = seasonality_analysis.plot()
No description has been provided for this image

TODO: Try using differences to de-seasonalize.

In [ ]:
# Copy the dataframe and de-seasonalize using decomposition
df_seasonadj = df_real.__copy__()
for i, col_name in enumerate(['avglowprice', 'avghighprice', 'avgdiffprice', 'lowestprice', 'highestprice']):
    seasonality_series = pl.Series(seasonal[:, i])
    df_seasonadj = df_seasonadj.with_columns(pl.col(col_name) - seasonality_series)

# Repeat for average temperature.
temperature_sa = sm.tsa.seasonal_decompose(df_real.select('tavg_avg'), period=12)
df_seasonadj = df_seasonadj.with_columns(pl.col('tavg_avg') - temperature_sa.seasonal[0])
In [ ]:
# Plot Autocorrelation
with mpl.rc_context():
    mpl.rc("figure", figsize=(16,3))
    fig = sm.tsa.graphics.plot_acf(df_seasonadj['avglowprice'].to_numpy())
No description has been provided for this image
In [ ]:
# Plot Partial Autocorrelation
with mpl.rc_context():
    mpl.rc("figure", figsize=(16,3))
    fig = sm.tsa.graphics.plot_pacf(df_seasonadj['avglowprice'].to_numpy())
No description has been provided for this image
In [ ]:
def print_adfuller_results(test_results: tuple) -> None:
    tstat = test_results[0]
    cv = test_results[4]
    msg = "With t-value >10%, null hypothesis of non-stationarity can't be rejected"
    if tstat < cv['1%']:
        msg = "Test statistic is lower than 1% t-value. It is reasonable to reject null hypothesis of non-stationarity."
    elif tstat < cv['5%']:
        msg = "Test statistic is lower than 5% t-value. It is possible to reject null hypothesis of non-stationarity."
    elif tstat < cv['10%']:
        msg = "Test statistic is lower than 10% t-value. It is possible but questionable to reject null hypothesis of non-stationarity."

    print(
f"""Augmented Dickey-Fuller Test Results
    Test statistic: {tstat}
    P-value: {test_results[1]}
    # of lags used: {test_results[2]}
    # of observations: {test_results[3]}
    Critical Values:
        1%:  {cv['1%']}
        5%:  {cv['5%']}
        10%: {cv['10%']}
    {msg}
"""
    )

adf_res = sm.tsa.stattools.adfuller(df_real['avglowprice'], regression='ct', autolag='BIC')
print_adfuller_results(adf_res)
Augmented Dickey-Fuller Test Results
    Test statistic: -4.80844940347214
    P-value: 0.00045269732654970124
    # of lags used: 1
    # of observations: 90
    Critical Values:
        1%:  -4.0630536556927295
        5%:  -3.4604500192043894
        10%: -3.156294156378601
    Test statistic is lower than 1% t-value. It is reasonable to reject null hypothesis of non-stationarity.

In [ ]:
# Let's start with a very simple univariate Auto-Regression with no exogenous predictors
model_1 = sm.tsa.AutoReg(
    df_seasonadj['avglowprice'].to_numpy(),
    lags=4,
    trend='n',
    seasonal=False
).fit(
    cov_type='HAC',
    cov_kwds={'maxlags': 10}
)
model_1.summary()
Out[ ]:
AutoReg Model Results
Dep. Variable: y No. Observations: 92
Model: AutoReg(4) Log Likelihood 44.418
Method: Conditional MLE S.D. of innovations 0.146
Date: Wed, 22 May 2024 AIC -78.836
Time: 05:17:00 BIC -66.449
Sample: 4 HQIC -73.846
92
coef std err z P>|z| [0.025 0.975]
y.L1 0.7873 0.125 6.314 0.000 0.543 1.032
y.L2 0.0324 0.099 0.328 0.743 -0.161 0.226
y.L3 0.1436 0.141 1.015 0.310 -0.134 0.421
y.L4 0.0231 0.078 0.296 0.767 -0.130 0.176
Roots
Real Imaginary Modulus Frequency
AR.1 1.0098 -0.0000j 1.0098 -0.0000
AR.2 -0.1790 -2.4919j 2.4983 -0.2614
AR.3 -0.1790 +2.4919j 2.4983 0.2614
AR.4 -6.8677 -0.0000j 6.8677 -0.5000

TODO: What does L2, L3, L4 being insignificant really mean?

In [ ]:
# And now let's add our exogenous predictors:
model_2 = sm.tsa.AutoReg(
    df_seasonadj['avglowprice'].to_numpy(),
    lags=1,
    trend='n',
    exog=df_seasonadj.select('tavg_avg', 'prcp_max_day', 'prcp_total_mo', 'tmin_days_below_five').to_numpy()
).fit(
    cov_type='HAC',
    cov_kwds={'maxlags': 10}
)
model_2.summary()
Out[ ]:
AutoReg Model Results
Dep. Variable: y No. Observations: 92
Model: AutoReg-X(1) Log Likelihood 44.507
Method: Conditional MLE S.D. of innovations 0.148
Date: Wed, 22 May 2024 AIC -77.014
Time: 05:17:00 BIC -61.949
Sample: 1 HQIC -70.936
92
coef std err z P>|z| [0.025 0.975]
y.L1 0.7894 0.068 11.534 0.000 0.655 0.924
x1 0.0054 0.002 2.464 0.014 0.001 0.010
x2 0.0006 0.001 0.784 0.433 -0.001 0.002
x3 2.71e-05 0.000 0.100 0.920 -0.001 0.001
x4 0.0140 0.006 2.231 0.026 0.002 0.026
Roots
Real Imaginary Modulus Frequency
AR.1 1.2668 +0.0000j 1.2668 0.0000

TODO: Talk more

Lots to talk about here.

x4 (tmin_days_below_five) edges into 5% significance. Out of the other predictors, x1 (tavg_avg) also has 5% significance.

In [ ]:
# Let's also try lagging the predictors with an ADL model:
model_2 = sm.tsa.ARDL(
    df_seasonadj['avglowprice'].to_numpy(),
    lags=1,
    trend='n',
    exog=df_seasonadj.select('tavg_avg', 'prcp_max_day', 'prcp_total_mo', 'tmin_days_below_five').to_numpy(),
    order=1,
).fit(
    cov_type='HAC',
    cov_kwds={'maxlags': 10}
)
model_2.summary()
Out[ ]:
ARDL Model Results
Dep. Variable: y No. Observations: 92
Model: ARDL(1, 1, 1, 1, 1) Log Likelihood 44.946
Method: Conditional MLE S.D. of innovations 0.148
Date: Wed, 22 May 2024 AIC -69.891
Time: 05:17:00 BIC -44.782
Sample: 1 HQIC -59.761
92
coef std err z P>|z| [0.025 0.975]
y.L1 0.7812 0.068 11.421 0.000 0.645 0.917
x0.L0 0.0024 0.006 0.387 0.700 -0.010 0.015
x0.L1 0.0033 0.005 0.720 0.473 -0.006 0.013
x1.L0 0.0004 0.001 0.510 0.611 -0.001 0.002
x1.L1 -0.0004 0.001 -0.564 0.575 -0.002 0.001
x2.L0 1.728e-05 0.000 0.066 0.948 -0.001 0.001
x2.L1 0.0002 0.000 0.995 0.323 -0.000 0.001
x3.L0 0.0104 0.009 1.163 0.248 -0.007 0.028
x3.L1 0.0053 0.004 1.388 0.169 -0.002 0.013

Well that sucks.

{bibliography}